The dataset has 6110 rows and 20 columns. The target variable we want to predict is price (in US dollars per night). log_price is the natural logarithm of price. They are essentially the same variable. Most of the columns are self-explanatory. A few notes about the less clear ones:
import pandas as pd
import geopandas as gpd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
import folium
Model Diagnosis: Get the best model from Q3, and analyze the patterns of its errors. Based on the analysis, propose 1-2 potential improvements to the model. (2 points)
Alternative Approach: Now let us change the target variable from price to log_price and repeat Q2 (with the same features). How would you compare the two approaches? (2 points)
data = gpd.read_file('regression_db.geojson')
data.head()
data.shape
Below shows a plot of circles on map, each representing an Airbnb listing. The area of the circle is proportional to the price of the listing, and the color indicates the number of bedrooms (blue for 1 bedroom, purple for 2 bedrooms, and red for 3 or more bedrooms). From this map, it is clear that the listings near the coast are generally more expensive but also have more bedrooms.
lng_map = data['geometry'].x.mean()
lat_map = data['geometry'].y.mean()
m = folium.Map(location=[lat_map, lng_map], # state the initial coordinates of the map location
tiles='cartodbpositron', # choose the style of the base map
zoom_start=12) # zoom the map to the ideal scale
for index, row in data.iterrows():
_radius = np.sqrt(row['price']) * 0.1
if np.isnan(_radius):
_radius = 0
if row['bedrooms'] == 1:
_color = '#81D8D0' # tiffany blue
elif row['bedrooms'] == 2:
_color = '#800080' # purple
else:
_color = '#E80018' # target red
lat, lon = row['geometry'].y, row['geometry'].x
folium.CircleMarker(location = [lat, lon],
radius = _radius,
color = _color,
opacity = 0.6).add_to(m)
m
To further dig out the implicit correlation of the Airbnb dataset, I utilized the heatmap function. The Correlation Between Variables Plot shows that the price has a stronger correlation between accommodates, bathrooms, bedrooms, and beds variables.
plt.figure(figsize=(10,10),dpi=300)
sns.heatmap(data.corr(method="pearson"), cmap="RdBu",square=True,
linewidths=.5, annot=True, fmt=".2f",
annot_kws={"fontsize":7}, cbar_kws={"shrink": 0.5})
plt.title("Correlation Between Variables Plot", size=15)
plt.show()
# convert the neighborhood variable to a collection of dummy variables
data2 = pd.get_dummies(data, columns=['neighborhood'], prefix = ['neighborhood'])
data2.info()
neighborhood_cols = [s for s in data2.columns if s.startswith('neighborhood_')]
pg_cols = [s for s in data2.columns if s.startswith('pg_')]
rt_cols = [s for s in data2.columns if s.startswith('rt_')]
numeric_cols = ['accommodates', 'bathrooms', 'bedrooms', 'beds', 'pool', 'd2balboa', 'coastal']
feature_names = numeric_cols + pg_cols[0:] + rt_cols[0:] + neighborhood_cols[0:]
# standardize
scaler = StandardScaler()
data2[numeric_cols] = scaler.fit_transform(data2[numeric_cols])
X = data2[feature_names]
y = data2['price']
X_train, X_test, y_train, y_test = train_test_split(X,
y,
shuffle=True,
test_size=0.25,
random_state=123)
print(len(X_train))
print(len(X_test))
X_train
def calc_train_error(X_train, y_train, model):
'''returns in-sample error for already fit model.'''
predictions = model.predict(X_train)
rmse = mean_squared_error(y_train, predictions, squared=False)
return rmse
def calc_test_error(X_test, y_test, model):
'''returns out-of-sample error for already fit model.'''
predictions = model.predict(X_test)
rmse = mean_squared_error(y_test, predictions, squared=False)
return rmse
def calc_metrics(X_train, y_train, X_test, y_test, model):
'''fits model and returns the RMSE for in-sample error and out-of-sample error'''
model.fit(X_train, y_train)
train_error = calc_train_error(X_train, y_train, model)
test_error = calc_test_error(X_test, y_test, model)
return train_error, test_error
lr = LinearRegression(fit_intercept=True)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_r2 = r2_score(y_test, lr_pred)
train_error, test_error = calc_metrics(X_train, y_train, X_test, y_test, lr)
train_error, test_error = round(train_error, 3), round(test_error, 3)
print('train error: {} | test error: {}'.format(train_error, test_error))
print('train/test: {}'.format(round(train_error/test_error, 3)))
print('R2 socre: {}'.format(round(lr_r2, 3)))
X2 = sm.add_constant(X)
OLS_result = sm.OLS(y, X2).fit()
print(OLS_result.summary())
Compare the numerous variables without categorical variables.
As to the categorical variables, firstly group the related variables (e.g., all the neighborhood variables). Then, comparing the prediction errors without the other group variable as they essentially represent one high-level variable.
## iterate through the numeric features, and compare the prediction errors
for v in numeric_cols:
sub_feature_names = list(set(feature_names) - set([v]))
sub_train_error, sub_test_error = calc_metrics(X_train[sub_feature_names], y_train, X_test[sub_feature_names], y_test, lr)
print('{} | test_error {} | delta_error {}'.format(v, sub_test_error, (sub_test_error - test_error)))
print('-' * 76)
## compare the prediction errors for columns related to property type
sub_feature_names = list(set(feature_names) - set(pg_cols[1:]))
sub_train_error, sub_test_error = calc_metrics(X_train[sub_feature_names], y_train, X_test[sub_feature_names], y_test, lr)
print('property types | test error {} | delta error {}'.format(sub_test_error, sub_test_error - test_error))
print('-' * 76)
## compare the prediction errors for columns related to room type
sub_feature_names = list(set(feature_names) - set(rt_cols[1:]))
sub_train_error, sub_test_error = calc_metrics(X_train[sub_feature_names], y_train, X_test[sub_feature_names], y_test, lr)
print('room types | test error {} | delta error {}'.format(sub_test_error, sub_test_error - test_error))
print('-' * 76)
## compare the prediction errors for columns related to neighborhood type
sub_feature_names = list(set(feature_names) - set(neighborhood_cols[1:]))
sub_train_error, sub_test_error = calc_metrics(X_train[sub_feature_names], y_train, X_test[sub_feature_names], y_test, lr)
print('neighborhood | test error {} | delta error {}'.format(sub_test_error, sub_test_error - test_error))
In the above evaluation, the delta error is the additional error it will incur to exclude a certain variable, which can be used to measure the contribution of the variable to the prediction performance. By this measure, the top 3 most important variables are:
# define the cross-validation function for ridge
def cv_ridge_alpha(X, Y, K, alphas):
print('All errors are RMSE')
print('-'*76)
cv_errors = []
for alpha in alphas:
errors = -cross_val_score(estimator=Ridge(alpha=alpha),
X=X,
y=Y,
scoring='neg_root_mean_squared_error',
cv=K)
# print errors as report
print('alpha: {:7} | cv error (mean): {:5} | cv error (all): {}'.
format(alpha,
round(np.mean(errors), 3),
[round(e, 3) for e in errors]))
# store the error to find best alpha
cv_errors.append(np.mean(errors))
print('best alpha: {}'.format(alphas[np.argmin(cv_errors)]))
return alphas[np.argmin(cv_errors)] # get the best alpha
# define the cross-validation function for lasso
def cv_lasso_alpha(X, Y, K, alphas):
print('All errors are RMSE')
print('-'*76)
cv_errors = []
for alpha in alphas:
errors = -cross_val_score(estimator=Lasso(alpha=alpha, max_iter=10000),
X=X,
y=Y,
scoring='neg_root_mean_squared_error',
cv=K)
# print errors as report
print('alpha: {:7} | cv error (mean): {:5} | cv error (all): {}'.
format(alpha,
round(np.mean(errors),3),
errors))
# store the error to find best alpha
cv_errors.append(np.mean(errors))
print('best alpha: {}'.format(alphas[np.argmin(cv_errors)]))
return alphas[np.argmin(cv_errors)] # get the best alpha
Use cross-validation to find a good regularization hyperparameter (aka alpha in sklearn).
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000] # lambda
ridge_alpha = cv_ridge_alpha(X_train, y_train, 5, alphas)
lasso_alpha = cv_lasso_alpha(X_train, y_train, 5, alphas)
Based on the above analysis, the best alpha for ridge is 10, and the best alpha for lasso is 1. Now let us rerun the model using the whole training data, and calculate the test errors.
ridge = Ridge(alpha=ridge_alpha)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
ridge_r2 = r2_score(y_test, ridge_pred)
ridge_train_error, ridge_test_error = calc_metrics(X_train, y_train, X_test, y_test, ridge)
ridge_train_error, ridge_test_error = round(ridge_train_error, 3), round(ridge_test_error, 3)
lasso = Lasso(alpha=lasso_alpha)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
lasso_r2 = r2_score(y_test, lasso_pred)
lasso_train_error, lasso_test_error = calc_metrics(X_train, y_train, X_test, y_test, lasso)
lasso_train_error, lasso_test_error = round(lasso_train_error, 3), round(lasso_test_error, 3)
print('ORIGINAL OLS ERROR')
print('-' * 50)
print('R2 socre: {}'.format(round(lr_r2, 3)))
print('train error: {} | test error: {}'.format(train_error, test_error))
print('train/test: {}\n'.format(round(train_error/test_error, 3)))
print('RIDGE REGRESSION ERROR')
print('-' * 50)
print('R2 socre: {}'.format(round(ridge_r2, 3)))
print('train error: {} | test error: {}'.format(ridge_train_error, ridge_test_error))
print('train/test: {}\n'.format(round(ridge_train_error/ridge_test_error, 3)))
print('Lasso REGRESSION ERROR')
print('-' * 50)
print('R2 socre: {}'.format(round(lasso_r2, 3)))
print('train error: {} | test error: {}'.format(lasso_train_error, lasso_test_error))
print('train/test: {}\n'.format(round(lasso_train_error/lasso_test_error, 3)))
Based on the comparison, the ridge model is marginally better than lasso and OLS. Ridge has higher training errors but lower testing errors than OLS. The difference between lasso and ridge is likely due to different regularization mechanism. Different random_states and different train_test_split may produce different results.
Based on the below plots, it shows that the error distribution is skewed. The model is more likely to underpredict than overpredict. And the prediction error becomes larger when the actual value increases. To address this issue, we may use a nonlinear model (e.g., random forest) or consider to transform the original value (e.g., log transform).
test_errors = lasso_pred - y_test
histograms = plt.hist(test_errors, bins=100)
plt.xlabel('test_errors')
plt.ylabel('Frequency')
plt.plot(y_test, test_errors, 'o')
plt.xlabel('y_test')
plt.ylabel('test_errors')
Next, plot the scatter diagram between the true value and predictions of the proposed four models, respectively. The plot shows that Random Forest helps us converge the price prediction scatter.
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor(n_estimators=42)
random_forest.fit(X_train, y_train)
rf_pred = random_forest.predict(X_test)
rf_r2 = r2_score(y_test, rf_pred)
print('R2 socre: {}'.format(round(rf_r2, 3)))
print('train error: {} | test error: {}'.format(train_error, test_error))
print('train/test: {}'.format(round(train_error/test_error, 3)))
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize=(10, 10))
fig.suptitle('True Values vs Predictions', size=20)
ax1.scatter(y_test, lr_pred)
ax1.set_title('Linear Regression')
ax2.scatter(y_test, lasso_pred)
ax2.set_title('Lasso')
ax3.scatter(y_test, ridge_pred)
ax3.set_title('Ridge')
ax4.scatter(y_test, rf_pred)
ax4.set_title('Random Forest')
for ax in fig.get_axes():
ax.set(xlabel='True Values', ylabel='Predictions')
Comparing the Distribution of those two variables could find out that the price distribution shows a right skewness. However, most machine learning models assume the data or parameters obey a normal distribution. Taking the logarithm allows values more significant than the median to be reduced by a certain percentage, resulting in normally distributed data that enhance the prediction capacity.
plt.figure(figsize=(10,5))
plt.subplot(121)
sns.distplot(data['price'], color="orange", fit=norm)
plt.title("Price Distribution", size=15)
plt.subplot(122)
sns.distplot(np.log(data['price']), color="orange", fit=norm)
plt.title("Log_Price Distribution", size=15)
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)
model = LinearRegression(fit_intercept=True).fit(X_train, y_train_log)
log_pred = model.predict(X_test)
log_test_errors_exp = mean_squared_error(y_test, np.exp(log_pred), squared=False)
print('ORIGINAL OLS ERROR')
print('-' * 40)
print('test error: {}\n'.format(test_error))
print('Log OLS ERROR')
print('-' * 40)
print('test error: {}'.format(log_test_errors_exp))
By comparison, the log_price prediction actually gives a higher testing error. But one advantage of using log transformation is that the prediction errors is no longer heavily skewed.
histograms = plt.hist(y_test - np.exp(log_pred), bins=100)
plt.xlabel('test_errors (based on log_price)')
plt.ylabel('Frequency')